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^ ■ Abstract 

, We compare the efficiency of four different algorithms to compute the overlap Dirac operator, 

!>■ ; 

both for the speed, i.e., time required to reach a desired numerical accuracy, and for the adaptability, 
' i.e., the scaling of speed with the condition number of the (square of the) Wilson Dirac operator. 

•/^ ■ Although orthogonal polynomial expansions give good speeds at moderate condition number, they 
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are highly non-adaptable. One of the rational function expansions, the Zolotarev approximation, 
is the fastest and is adaptable. The conjugate gradient approximation is adaptable, self-tuning, 
and nearly as fast as the ZA. 
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I. INTRODUCTION 



A lack of consistent definition of chiral fermions on the lattice has hampered definitive and 
convincing investigations of chiral aspects of quantum chromodynamics (QCD) until now. 
Thus important physics issues, such as the spontaneous breaking of the chiral symmetry 
at low temperatures and its restoration at finite temperature, have remained hostages to 
technical questions such as the fine-tuning of the bare quark mass (Wilson fermions) or the 
precise number of massless flavours (staggered fermions). Recent developments in defining 
exact chiral symmetry on the lattice have therefore created exciting prospects of studying 
an enormous amount of physics in a cleaner manner from first principles. However, the 
corresponding Dirac operators are much more complicated. Without good control of the 
algorithms needed to deal with them, one is unlikely to derive the full benefit of their better 
chiral properties. Our goal in this paper is to evaluate the efficiency of the most widely used, 
or most promising, algorithms. By efficiency we mean both the speed of the algorithm, which 
is measured by the computer time required to achieve a certain accuracy in the solution, and 
the adaptability, which is measured by how the speed scales as the problem becomes harder. 
This study is made for various values of the required accuracy along with the corresponding 
analysis on the accuracy obtained for the expected properties of the resulting Dirac operator 
such as the Ginsparg- Wilson relation, the central circle relation, 75 hermiticity or normality. 
In particular, we have observed that these properties can be satisfied accurately only if the 
sign computation of the Wilson Dirac operator has high enough precision. 

A. The overlap Dirac operator 

One version of chiral fermions on the lattice is the overlap formalism. The overlap Dirac 

n 

operator {D) is defined in terms of the Wilson-Dirac operator (D^) by the relation 




(1) 



In this paper we shall use the shorthand notation 



M = Dla 



w 



(2) 



The Wilson-Dirac operator D^^ (for lattice spacing a 



1) is given by 




(3) 
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where and d* are (gauge covariant) forward and backward difference operators respec- 
tively. It has been shown that as long as the mass m is in the range —2 < m < 0, the 
above overlap Dirac operator is well-defined, and corresponds to a single massless fermion. 
Furthermore, it satisfies the Ginsparg- Wilson relation 

75D + D75 = Dj,D, (4) 

which leads to a good definition of chirality on the lattice and has been shown to correspond 
to an exact chiral symmetry on the lattice. 

The overlap Dirac operator, D, enjoys many nice properties in addition to the Ginsparg- 
Wilson relation in Eq. (@]). In particular, it satisfies 75-Hermiticity — 

= 75/^75. (5) 

Together with the Ginsparg- Wilson relation, this implies that D is normal, i.e., 

[D,D^=0. (6) 

Normality clearly means that D and D"^ have the same eigenvectors. Eqs. ()4I5|) also imply 

D + D^ = DW, (7) 

and hence the eigenvalues of D and lie on the unit circle centered at unity on the real 
line, implying that — 1 is unitary. We define measures of numerical errors on each of these 
quantities, and relations between them in Section |nj 



B. Numerical algorithms 

All computations of hadronic correlators involve the determination of the fermion propa- 
gator D~^, and need a nested series of two matrix iterations for their evaluation, since each 
step in the numerical inversion of D involves the evaluation of M~^/^. This squaring of effort 
makes a study of QCD with overlap quarks very expensive. 

This problem defines for us the properties that an efficient algorithm to deal with M~^/^ 
must have. First, it should achieve a given desired accuracy as quickly as possible. The 
need for accuracy is clear: the accuracy to which the Ginsparg- Wilson relation in Eq. Q is 
satisfied depends on the accuracy achieved in the computation of M^^/^. The second, and 
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equally important, requirement is that the method should adapt itself easily to matrices 
with widely different condition numbers — 

^min 



where Amin and Amax are, respectively, the minimum and the maximum eigenvalues of M. 
Adaptability is needed because in QCD applications the eigenvalue spectrum of M can 
fluctuate over many orders of magnitude from one configuration to the next. Since the 
condition number on a configuration is a priori unknown, a method with low adaptability will 
end up either being inefficient or inaccurate or even both. Algorithms, which are adaptable in 
principle, may require tuning of parameters by hand, or they may incorporate a procedure for 
self-tuning. Clearly, self-tuning algorithms are the ones that can best deal with fluctuating 
condition numbers in any real situation. 

In this paper we examine the speed and adaptability of several different algorithms to 
compute the inverse square root, M~^/^, of Hermitean matrices (in our applications the 
eigenvalues of M are non-negative ) acting on a vector. Several algorithms for this have 
been proposed in the literature. 

We do not consider the first algorithm to be proposed, since this requires a matrix inver- 
sion to be performed at each step of an iteration to determine M~^/^ Later algorithms 
are more efficient. These fall into two classes — expansions of l/-\/z in appropriate classes of 
functions (rational functions J] or orthogonal polynomials P]), and iterative methods P|. 
We have analyzed four derived algorithms, namely the Optimized Rational Approximation 
(ORA) W, the Zolotarev Approximation (ZA, which is also a rational expansion) 0,0], the 
Chebychev Approximation (CA, a polynomial expansion) jl^ and the Conjugate Gradient 
Approximation (CCA, an iterative method) ^|. We find that an expansion in Chebychev 
polynomials is the fastest when k is moderately large, but it suffers from various instabili- 
ties including a lack of adaptability. Rational expansions cure many of the instabilities of 
polynomial expansions; indeed the ZA is the fastest and is adaptable but not self-tuning. 
An iterative method is the only fully self-tuned algorithm, and it turns out to be reasonable 
also from the point of view of speed. 
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C. Algorithmic costs and adaptability 



We make two different estimates of tlie cost of each algorithm. The complexity, C, counts 
the number of arithmetic operations required to achieve the solution to the problem and is a 
measure of speed independent of the specific machine on which the algorithm is implemented. 
The spatial complexity, S, is the memory requirement for the problem. While timing runs 
on particular machines on chosen test configurations are instructive, the scaling of speed for 
each algorithm with physical and algorithmic parameters is provided by our estimates of C. 

Our estimate of the adaptability. A, of each algorithm is the following. If the scalar 
algorithm for is tuned to have maximum relative error e in the range [-Zmin? -^max], then 

we find the smallest range [z'^^^, -z^ax] where the error is at most lOe. Note that the second 
interval cannot be smaller than the first. In terms of these quantities we define 

_ ^QS(^max/^min) _ ^ ^g^ 
log ( ^max / ^min ) 

The least adaptable algorithms have small values of ^ ( ^ > ). ^ is a measure of the 
relative accuracy achieved in a fixed CPU time for the same algorithm running on two 
different configurations with condition numbers k = -Zmax/^min and k' = -Zmax/^min- 
conjunction with estimates of C, it also contains information about the scaling of CPU time 
required to achieve the same accuracy on the two configurations. 

D. Numerical tests 

Our numerical tests were performed with three typical SU{3) gauge configurations on a 
4 X 12'^ lattice at (3 = 5.80 (i.e., T = 1.25Tc). The configuration A had eigenvalues of M 
in the range [0.032, 32] so that n = 10^. The configuration B had eigenvalues in the range 
[7.2 X 10^^,32], giving k, = 4.4 x 10^. Finally, configuration C had eigenvalues in the range 
[8.9 X 10~^, 32] and hence k = 3.6 x 10^. Configuration A is one of the easiest configuration 
we found in our simulations, and there were several configurations with k of order 10^-10^ 
in the data set we worked with in If there is a single scale in the level spacing of 

the eigenvalues of M, then we expect k, = 0{V) ^ 7 x 10^ on our test configurations. 
We conclude that A is indeed a little easier than the generic configuration and B and C 
are successively harder. The CPU times we quote in our tables are obtained on a Fujitsu 
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VPP5000, which is a vector computer. Our computations ran on this machine at a speed of 
around 4.1 Gigaflops. 

In sections IlllHVll we describe and analyze the four algorithms and also present data on 
precision and time measurements on these three test configurations. In this work we have 
not investigated the performance of the algorithms with deflation (explicit subtraction) of 
some eigenvectors. However, we do remark on the precision required for deflation and the 
propagation of errors due to such a subtraction. Section IVIII contains a comparison of the 
numerical results and our conclusions. 

II. ERRORS 

In general, every numerical method to compute M~^/^ constructs an operator L which 
applied to a vector $ gives the vector 

X = L[$] with L[$] = M-^/^$ + (10) 

where E is the error in the approximation, L, to the matrix M~^/^. Typically, these operators 
L and E are not matrices because the algorithms can introduce a dependence on $ which 
is not linear. The error E[^] on the computation of M~^/^ leads to the violation of the 
properties in Eqs. (@]|7j). In our numerical tests we have investigated five measures of the 
accuracy of the algorithms through norms of the following operators — 

Zy2 = ML'-l, Zgw = D-f, + -f,D-D^5D, Zn = DD^ -D^D, 
Zh = D^- -f5D^5, Zcc = D + D^- DW, (11) 

where D = 1 + D^L and D"^ = 1 + LD]^. Each of these operators is zero when £'[$] = 0. 
With Gaussian random vectors $, we have measured the deviations from this exact value 
through 

e, = \ZMI\^\ and e[ = Z,<^ (12) 

Here, and later, we use the notation |f | = \/vhj for any complex vector v. Note that ej are 
real and non-negative whereas e[ are complex in general. 
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A. Polynomial approximation 



The polynomial approximation consists of writing 

No 

L = Y.hiM\ (13) 

i=l 

where hi are constants. It is clear that both L and E are matrices in this case and 

[L, M] = 0. (14) 

Since in numerical implementations of -D^, 75-Hermiticity is accurate to machine precision, 
i.e., K-Di; — 75-0^75)$! = 0, one has the following relation: 

75L'^M"75 = M^Dl (15) 

Its direct consequence is 

Zh = LDl - j,D^L-f, = 0. (16) 
Using Eqs. ()14HT3j) . one can easily obtain the following relations between the various Z's : 

Zee = —Zi/2, 
Zgw = IbZcc, 
Zn = Zee - IbZcclb- (17) 

As a consequence, 

1^1/2! I^ccn 

£1/2 = CcC = ^GW: 



= e« = 0. (18) 



Expanding Z1/2 = ML"^ — 1 = Zl^^ powers of E and retaining only the leading order 
terms, we find that Zl,2Zi/2 = AE'^M and Z1/2 = 2EM^^'^. Defining the averages of and 



e'i over an ensemble of $ as e? = Tr ZjZi and = Tr Zj, one obtains, 

e?^ = eL = 4^ = 4/ dXp{X)Xe\X), 

i[;; = -i^ = 2 J dXp{x)Vxe{x), 

7^ = 2 J dXAp{X)VXe{X), (19) 



where p{X) is the density of eigenvalues of M, e(A) is the error in the approximation and 
Ap(A) = p+(A) — P-(A), the difference between the spectral densities (of M) in the chiral 
positive and negative sectors. Note that V^e(A) is the relative error in the determination 
of the inverse square root, and all the integrals depend only on this relative error. Since 
there is no reason for p(0) to vanish, it is clear that the error in the expansion must remain 
under control even as A — 0. Clearly, this is impossible to arrange in polynomial expansions 
for 1 / -\/A. However, a finite sample of gauge configurations does not need full control over 
e(0), but only for e(A<), where A< is the smallest eigenvalue encountered in the sample. 
To achieve this while optimizing CPU costs on configurations where all the eigenvalues are 
much larger requires the algorithm to be adaptive. 

It was assumed above that no deflation has been performed, or that deflation has been 
performed with no arithmetic errors. We comment on the effects of deflation in Section IVlIl 



B. Rational approximation 

In case of a rational function approximation to M~^/^, one writes the operator L as 

No 

L = V -P— (20) 

E here depends on the order Nq and the accuracy of the inversion of (M + di). If the 
inversion can be achieved with infinite precision, then L is a matrix again which commutes 
with M and the analysis of the previous subsection applies in full. If, on the other hand, 
the error due to the inversion dominates, then for many algorithms, such as the Conjugate 
Gradient, L depends explicitly on the vector $ in a complicated way and it is not a matrix. 
One has to compute the different errors explicitly and study their behavior as in iterative 
methods. Thus the behavior of errors from rational approximation case interpolates between 
that of the polynomial approximation and an iterative method according to the relation 
between the order and the precision of the inversion. 



III. FIXED ORDER: CHEBYCHEV APPROXIMATION 

The first use of the polynomial approximation utilized Legendre polynomials jsl. Later 
the same group proposed a more robust version using the Chebychev approximation (CA) 
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[10|. As is well known, when expanding any function, f{z) in a fixed range Zmin < -2 < -Zmax, 
to a given order No through orthogonal polynomials, the use of Chebychev polynomials 
minimizes the maximum error on the function to be approximated. 




FIG. 1: The panel on the left shows the order of Chebychev expansion, Nq required to reach 
an accuracy of 10^'^ (boxes), 10"'* (circles) and 10"® (pentagons) in the range [;Uinini32] under 
variation of /imin (the lines are proportional to log(l/e)/y^^j^). The panel on the right shows the 
error for a Chebychev expansion in the range [0.01,32] with Nq = 400. 

For the function l/v^5 fhe coefficients of the Chebychev expansion for z^^^ < z < z^^-^ 
to order are 

C7, = ^V cos((fc-l)(j-|)7r/iVo) 

^ J^i [^min + ^max + (^min - ^max) COs((j - \)tI / Nq)]^/"^ ' 

Applying this approximation to a matrix M corresponds to finding L$ (see Eq. by the 
expansion — 

1 No 

L$ = -ci'l'^") + £ Cfe$(*^-i), (22) 

2 k=2 

where the successive vectors can be found by the iteration = $ and 

= ^ [2M<I>(""1) - (^^ax + ^mm)$^""^^] " (23) 

•2^max ^min 

for n>2. For n = \ the term <|)(~^) is dropped from the recursion. 

There are various sources of error, which we now analyze in succession. For the scalar 
version of the algorithm (or for each eigenvector of M separately), with fixed ^^min, 2;max 
and parameter Nq-, there is an error in the approximation of which we call e{z). For 

computations at arbitrary precision with a fixed range, [^min, ^max], ^{z) depends entirely on 



No 


£-1/2 




^GC 


time 


30 


0.273 X 10" 


-1 


0.273 X 10" 


-1 


0.273 X 10" 


-1 


0.3 


65 


0.174 X 10" 


-2 


0.174 X 10" 


-2 


0.174 X 10" 


-2 


0.7 


100 


0.136 X IQ- 


-3 


0.136 X 10" 


-3 


0.136 X 10" 


-3 


1.1 


135 


0.113 X IQ- 


-4 


0.113 X 10" 


-4 


0.113 X 10' 


-4 


1.5 


165 


0.140 X IQ- 


-5 


0.140 X 10" 


-5 


0.140 X 10" 


-5 


1.8 


200 


0.125 X 10" 


-6 


0.125 X 10" 


-6 


0.125 X 10" 


-6 


2.2 


235 


0.113 X 10" 


-7 


0.113 X 10" 


-7 


0.113 X 10" 


-7 


2.5 


270 


0.102 X IQ- 


-8 


0.102 X 10" 


-8 


0.102 X 10' 


-8 


2.9 


300 


0.134 X 10" 


-9 


0.134 X 10" 


-9 


0.134 X 10' 


-9 


3.4 


330 


0.174 X 10~ 


10 


0.174 X 10- 


10 


0.174 X 10- 


10 


3.7 


360 


0.230 X 10" 


11 


0.230 X 10" 


11 


0.230 X 10" 


11 


4.1 


390 


0.391 X 10" 


12 


0.391 X 10" 


12 


0.391 X 10" 


12 


4.4 


420 


0.210 X 10- 


12 


0.209 X IQ- 


12 


0.209 X 10- 


12 


4.8 


450 


0.226 X 10- 


12 


0.226 X IQ- 


12 


0.226 X 10- 


12 


4.9 



TABLE I: Runs with the CA adjusted to the interval [0.032, 32] for varying No on the configuration 
A. The last column gives the CPU seconds used on a Fujitsu VPP5000. 



No 


^1/2 


^GW 


fee 


time 


100 


0.236 X 10" 


-1 


0.236 X 10" 


-1 


0.236 X 10" 


-1 


1.1 


500 


0.294 X 10" 


-2 


0.294 X 10- 


-2 


0.294 X 10" 


-2 


5.4 


1000 


0.485 X 10- 


-3 


0.485 X 10- 


-3 


0.485 X 10" 


-3 


10.8 


1500 


0.108 X 10- 


-3 


0.108 X 10- 


-3 


0.108 X 10" 


-3 


16.9 


2000 


0.292 X 10" 


-4 


0.292 X 10" 


-4 


0.292 X 10- 


-4 


21.8 


3000 


0.254 X 10" 


-5 


0.254 X 10- 


-5 


0.254 X 10" 


-5 


32.7 


4000 


0.267 X 10- 


-6 


0.267 X 10- 


-6 


0.267 X 10- 


-6 


43.7 



TABLE II: Runs with the CA adjusted to the interval [7.2 x 10"^, 32] on the configuration B. The 
last column shows the CPU seconds used on a Fujitsu VPP5000. 
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No 


£l/2 




^CC 


time 


100 


0.253 X 10" 


-1 


0.253 X 10' 


-1 


0.253 X 10" 


-1 


1.1 


500 


0.788 X 10- 


-2 


0.788 X 10- 


-2 


0.788 X 10" 


-2 


5.5 


1000 


0.471 X 10" 


-2 


0.471 X 10" 


-2 


0.471 X 10" 


-2 


10.9 


1500 


0.395 X 10" 


-2 


0.395 X 10" 


-2 


0.395 X 10" 


-2 


16.2 


3000 


0.460 X 10" 


-2 


0.460 X 10" 


-2 


0.460 X 10" 


-2 


32.4 



TABLE III: Runs with the CA adjusted to the interval [8.9 x 10 ^,32] on the configuration C. The 
last column shows the CPU seconds used on a Fujitsu VPP5000. 

No- To keep the absolute relative error bounded, |e(z)|A/i < e, as z^^-^ changes, we must 
tune 

as shown in Figure [TJ The last expression follows if we choose Zmin = Amin and Zmax = Amax- 
However, as shown in the right panel of the figure, tuning No in this manner causes the 
relative error outside the range to blow up. The CA has no tolerance to violations of the 
requirement on the range. This is easy to understand. In any polynomial approximation, 
with decreasing 2;min larger values of No are required for keeping the error fixed within the 
interval [^min, ^max]- However, outside this interval, the error then increases as 

/ \ /2z ^^jjiin 2^max \ p 

e[z) OC ( I , for Z < Zrain OI Z > Z^ai,^, (25) 

\ ^max ■2^min / 

and hence the error increases faster as z^^^ decreases. Solving this for z, given some fixed 
value of e{z)/e^ we can easily see that for large k and fixed precision e. 



A OC exp(— av^) 



(26) 



where a is some number. 

The effect of finite arithmetic precision can also be analyzed easily since the iteration in 
Eq. is linear. Any arithmetic error, 5^'^\ in ^^^^ of the order of the machine precision 
remains in control whenever all eigenvalues, A of M satisfy ^min < A < ^max- If any eigenvalue 
of M lies outside this range, then the iteration magnifies the error geometrically — 



^(m+n) 



'2A, 



(27) 
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This is the vector version of the low adaptabihty of this algorithm. If estimates of Amin and 
Amax for M are available, then, in view of this instability, it is best to choose Zmin < Amin 

and ^max ^ Ajjiax- 

The complexity of this algorithm is clearly dominated by the time required to operate 
upon a vector by the matrix M in the iterations in Eq. (j23p . For the Wilson-Dirac matrix 
this time is of order V . Neglecting the time taken for scalar operations in the remainder of 
the algorithm, and also the order V time for vector additions, in comparison with this, we 
have — 

CcA ^ wNoV ~ w'V log Q v^, (28) 

where wV is the complexity of operating upon a vector by M and w, w' are constants. 
Apart from the gauge configuration, in QCD applications the storage required is for the 
three vectors needed for the iteration in Eq. (j23p . The space complexity is therefore 

ScA = 8N,{N, + 3)V, (29) 

(for Nc colors) neglecting storage for scalars. 

With a fixed No, the precision of the algorithm deteriorates sharply when one or more of 
the eigenvalues of M lie outside the interval [zmin, ^max], as shown in Figure [T] and by the low 
value of A in Eq. (j^ . This is often sought to be corrected for by deflating, i.e., explicitly 
dealing with the eigenspace of the lowest eigenmodes, and applying the algorithm to the 
orthogonal space. This would keep the accuracy constant as the condition number changes. 
If Nd vectors need to be deflated, then the contribution to C clearly increases as {N^Vy, 
since each vector has to be orthogonalized with respect to every other. Also, S increases as 
NdV due to the necessity of storing the vectors. In order to achieve a target precision, e, on 
all gauge configurations, we are forced to deflate all vectors with A < z^i^. Working with a 
fixed No forces us to do unnecessarily large amount of work on most configurations, while 
still failing on a small set of configurations. As a result, Nd has to be chosen appropriately 
for each configuration. With deflation then we have 

CcA = w'V\og(^^^,/J^ + w"V^Nn^), 

ScA = 8N,{N, + 3)V + 8N,{No)V, (30) 

where the angular brackets denote averages over gauge configurations, w" is a constant 
independent of V and Nd, and K^jf is the condition number of the matrix after deflating 

12 



No vectors. With careful programming we can arrange to make w" < w', although they 
cannot differ by orders of magnitude {w" can depend on e and Amin)- 

We have not investigated the expectation values of N^. However, when we change the 
volume at fixed physics, we expect that {N^,) oc Vp, where p is the average density of 
eigenvalues of M near Amin- Since deflation is designed to hold Kg// fixed, this means that 
for large enough V the complexity Cqa oc V^. On the other hand, k should generically 
grow linearly in V. Hence, on sufficiently large volumes, without defiation we would have 
CcA oc V^^"^. For best actual performance, one would have to tune the playoff between these 
two limits. 

A further complication arises in CA, and indeed, in any method which utilizes a poly- 
nomial expansion. For any fixed finite precision, the defiation of is inaccurate since 
each component of the defiated vector is in error at least in the least significant bit. Since 
the CA iteration of Eq. (j221) is not stable on the defiated eigenspace, this error blows up 
geometrically as in Eq. ()27|1 . Consequently, more and more bits are corrupted, as the it- 
eration proceeds, and the process may eventually render the whole computation unusable. 
Note that this problem becomes more acute with decreasing k, even if several eigenvectors 
are defiated. To prevent the error from swamping the result in this fashion, one has to 
reorthogonalize repeatedly (this process itself is not free of complications, see 12 1). 
This involves finding Nd dot products and subtracting vectors. While this is crucial in 
maintaining the accuracy of the result, it does not change the complexity, and we still have 
the results in Eq. (fHUjl . 

A different approach, and the one we have adopted for our tests, is to start the algorithm 
by making an estimate of Amin and Amax and then to select Nq accordingly. This obviates 
any need for deflation, and controlling the rounding errors in such a method. The algorithm 
is well-behaved, both with respect to precision and propagation of rounding errors, whenever 
Zmin < Amin < Amax < ^max- Howcvcr, in this casc the Complexity rises as y^- 

As shown in Table HJ the algorithm performs well on conflguration A. Nq needed to 
achieve a given value of ei/a is seen to rise logarithmically, as argued above. Consequently, 
ei/2 can be made very small and the required chiral properties obtained at any desired 
precision. Note also that the equalities ()18|) are exactly satisfled, as expected, and remain 
so for conflgurations B and C too, as seen from the Tables ITTl and UTTl respectivelv. However, 
the algorithm was found to be extremely slow for these conflgurations and saturated at 
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No ~ 1500 with ei/2 = 4 x 10 ^ on the configuration C, leading to the same precision for 
both the GW relation and the unit circle property of D. 



IV. FIXED ORDER: OPTIMIZED RATIONAL APPROXIMATION 




io-8| . . . . . 

0.0001 0.001 0.01 0.1 1 10 100 

X 



FIG. 2: The relative error, |e(z)|^/i, in the ORA with Nq = 14 fitted to the range [0.01,1] for 
computation in and outside the range. 



The first algorithm to compute M through a rational expansion was a polar for- 
mula introduced by Neuberger An improved version ^7|, called the optimized rational 
approximation, uses coefficients obtained numerically to give the approximation 



Ck 



No 



(31) 



where a is an arbitrary scale whose choice we describe later, and the values of and for 
a given Nq are obtained by optimizing the fit on some fixed interval [zmm, -^max] using the 
Remez algorithm (see j?! for details). The inversion of {aM + d^) is made by a multimass 
conjugate gradient stopped according to a value e^G for the residual. This version is used in 
particular in jl^, and jisj ]. 
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a 




lO^ecw 


lO^ei/a 


lO^ecvK 


lO^ei/a 


lO^ecvK 


1.00 


11.2 


11.3 


11.2 


11.2 


11.2 


11.2 


0.50 


1.74 


2.01 


1.44 


1.44 


1.44 


1.44 


0.20 


0.987 


1.40 


0.0780 


0.116 


0.0257 


0.0273 


0.15 


0.986 


1.40 


0.0744 


0.114 


0.0112 


0.0143 


0.12 


0.986 


1.40 


0.0744 


0.114 


0.0110 


0.0142 


0.10 


0.986 


1.40 


0.0744 


0.114 


0.0111 


0.0142 


0.08 


0.986 


1.40 


0.0744 


0.114 


0.0111 


0.0143 


0.05 


0.986 


1.40 


0.0744 


0.114 


0.0111 


0.0142 



TABLE IV: Tuning a in ORA for a fixed configuration with three different values of ecc- Since 
this fixes the upper part of the range of eigenvalues, we expect little change from one configuration 
to another, and a = 0.1 is a global choice. 





^CG 


£l/2 




^CC 


time 


10" 


-1 


25 


0.179 X 10" 


-1 


0.257 X 10" 


-1 


0.539 X 10" 


-1 


0.5 


10" 


-2 


63 


0.986 X 10" 


-3 


0.140 X 10" 


-2 


0.614 X 10" 


-2 


1.1 


10" 


-3 


99 


0.744 X 10" 


-4 


0.114 X 10" 


-3 


0.507 X 10" 


-3 


1.6 


10" 


-4 


134 


0.111 X 10" 


-4 


0.142 X 10" 


-4 


0.397 X 10" 


-4 


2.1 


10" 


-5 


166 


0.916 X 10" 


-5 


0.919 X 10" 


-5 


0.966 X 10" 


-5 


2.6 


10" 


-6 


198 


0.914 X 10" 


-5 


0.914 X 10" 


-5 


0.914 X 10" 


-5 


3.1 



TABLE V: Runs with the ORA for Nq = 14 optimized in the interval [0.01, 1] and a = 0.1 on 
configuration A. The last column gives the CPU seconds used on a Fujitsu VPP5000. 

Taking the values of Ck and dk for Nq = 14, z^i^ = 0.01 and Zmax = 1 from W, we show in 
Figure 121 the relative accuracy, |e(2;)|A/i for the expansion both inside and outside the fitted 
range. It is clear from the figure that the algorithm rapidly degrades outside the chosen 
range. A numerical computation shows that 

^ ^ 1, (32) 

so that it has much higher adaptability than the CA. Nevertheless, it makes large errors on 
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^CG 


^1/2 




^CC 


time 




-1 


71 


0.148 X 10" 


-1 


0.243 X 10- 


-1 


0.542 X 10- 


-1 


1.2 


10- 


-2 


332 


0.767 X 10- 


-4 


0.113 X 10- 


-3 


0.799 X 10- 


-2 


5.0 


10" 


-3 


363 


0.220 X 10- 


-4 


0.229 X 10- 


-4 


0.499 X 10- 


-2 


5.5 


10" 


-4 


395 


0.208 X 10" 


~4 


0.208 X 10- 


-4 


0.160 X 10- 


^2 


6.0 


10- 


-5 


426 


0.208 X 10- 


-4 


0.208 X 10- 


-4 


0.209 X 10- 


-4 


6.4 



TABLE VI: Runs with the ORA for Nq = 14 optimized in the interval [0.01,1] and a = 0.1 on 
configuration B. The last column shows the CPU seconds used on a Fujitsu VPP5000. 





NcG 


£l/2 






time 


10-1 


317 


0.161 X 10-1 


0.246 X 10-1 


0.547 X 10-1 


4.9 


10-2 


781 


0.333 X 10-2 


0.333 X 10-2 


0.111 X 10-1 


11.9 


10-3 


815 


0.333 X 10-2 


0.333 X 10-2 


0.771 X 10-2 


12.5 



TABLE VIL Runs with the ORA for Nq = 14 optimized in the interval [0.01, 1] and a = 0.1 on 
configuration C. The last column shows the CPU seconds used on a Fujitsu VPP5000. 

the eigenvalues of M which are greater than 2;max- It is easy to see that many eigenvalues 
are greater than unity, and a scaling factor a is therefore needed to bring these into range. 
The tuning of a is shown in Table IIVI For the conjugate gradient inversion of each term, 
when ecG is large, it determines the accuracy of the solution. Hence e^G must be kept small 
enough so that the accuracy is e. 

In order to specify the scaling of the CPU time in each algorithm with various parameters 
of the problem, we count the complexity of the method. For this algorithm, C is proportional 
to the number of steps of the Conjugate Gradient inversion, Ncg- It can be proved that Ncg 
grows no faster than ^/K,. However, from the observed convergence rate of the CG iterations 
(shown in Section |VH|) . we see that Ncg oc log(l/ecG) log z^- This expression can be used 
when ecG is tuned to be smaller than the error shown in Figure |21 For each step of the 
multimass CG inversion, the complexity is dominated by the time required to operate upon 
a vector by the matrix M. For the Wilson-Dirac matrix this is of order V. Neglecting the 
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time taken for scalar operations and also the order V time for vector additions, we have — 

Cora ^ wNqgV ^ w'V log (—] log k, (33) 



where wV is the complexity of operating upon a vector by M, and w' is a constant. The 
dependence of Nqg on V is very weak for realistic Ncg and is neglected. The memory 
requirement is essentially for the storage of the gauge configuration and for 2 + 2No vectors 
required to build up the approximation. The space complexity is therefore 

SoRA = 8N,{N, + 2 + 2No)V, (34) 

(for Nc colors) neglecting storage for scalars. 

With a fixed value of Nq, the precision of the algorithm deteriorates when the eigenvalues 
of M lie outside the range [^min, z^ax], as shown in Figure El For D^j, the highest eigenvalue 
remains in the vicinity of 32 for most configurations, whereas the lowest eigenvalue may 
fluctuate by several orders of magnitude. The large eigenvalues are brought into range by 
tuning a < 1 as shown already. However, this drives the lowest eigenvalues further out of the 
range, thus degrading performance. As a result, it is necessary to deflate, i.e., explicitly deal 
with the eigenspace of the lowest eigenmodes, and apply ORA on the orthogonal space in 
order to keep a constant accuracy as the condition number changes. As before, this changes 
the complexity to 

Cora = w'Vlog(—)logKeff + w"{ND')V\ 

SoRA = 8N,{N, + 2 + 2No)V + 8N,{Nd)V. (35) 

where the angular brackets denote averaging over the sample of configurations used. From 
available data on the growth of No required to keep the relative error fixed with growing 
K 0] it seems that it is better to increase Nq rather than to keep Heff fixed by increasing 
{Nd"^) with increasing volume at fixed physics. 

In Tables IV| IVII and IVIII we show the results of our numerical tests of the ORA using the 



set of Cfc and dk for Nq = 14 from |'/| for the three configurations already described. One 
sees that with higher precision of inversion, ecc, the relations (fT^ indeed get satisfied well. 
The gradual deterioration of the performance of ORA with fixed No in going to larger k is 
clear from the tables. This indicates that the performance of ORA may improve if a degree 
of adaptability can be built into the algorithm by, for example, allowing for changes in No 
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because with Nq = 14, we cannot obtain better precision than 0.00001, 0.00002 and 0.003 
for the GW relation for configurations A, B and C respectively. 



V. FIXED ORDER: ZOLOTAREV APPROXIMATION 




1e-14 I — • ^ — ■ ^ — • ^ — ■ ^ — ■ ^ — • ^ — ■ ^ — • ^ 

1e-05 0.0001 0.001 0.01 0.1 1 10 100 1000 



FIG. 3: The relative error, |e(z)|-yz for ZA in the range [10 ^,32] for computation in and outside 
the range for various Nq- 

The Zolotarev algorithm was explored in and used in j^. It is a rational expansion 
defined by 

(36) 



in the range [zmin, ^max] (the smallest and largest eigenvalues of M must satisfy the conditions 
^min < Amin < A^ax < z^a.y) i and the cxpausiou coefficients are 



ai = C21-1 ana ui = uq- 

The q's are 



di = C21-1 and k = do -. (37) 



ci = , (38) 

Cn'^{lK/2No] Jl- ^min/^max) 
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No 


^CG 


NcG 


£l/2 


^GW 


^CC 


time 


5 


10' 


-1 


22 


0.207 X 10" 


-1 


0.281 X 10" 


-1 


0.534 X 10" 


-1 


0.3 


6 


10" 


-2 


56 


0.144 X 10" 


-2 


0.216 X 10" 


-2 


0.612 X 10" 


-2 


0.7 


7 


10' 


-3 


91 


0.114 X 10- 


-3 


0.160 X 10" 


-3 


0.668 X 10" 


-3 


1.2 


8 


10" 


-4 


126 


0.984 X 10" 


-5 


0.150 X 10" 


-4 


0.526 X 10" 


-4 


1.7 


9 


10" 


-5 


159 


0.849 X 10" 


-6 


0.123 X 10" 


-5 


0.467 X 10" 


-5 


2.2 


10 


10" 


-6 


191 


0.808 X 10" 


-7 


0.117 X 10" 


-6 


0.365 X 10" 


-6 


2.6 


11 


10" 


~7 


223 


0.808 X 10" 


-8 


0.119 X 10" 


-7 


0.343 X 10" 


-7 


3.1 


13 


10" 


-8 


259 


0.616 X 10" 


-9 


0.924 X 10" 


-9 


0.288 X 10" 


-8 


3.8 


14 


10" 


-9 


295 


0.566 X 10- 


10 


0.798 X 10- 


10 


0.289 X 10" 


-9 


4.4 


15 


10" 


10 


326 


0.557 X 10- 


11 


0.793 X 10- 


11 


0.296 X IQ- 


10 


5.0 


16 


10" 


11 


357 


0.553 X 10" 


12 


0.723 X 10" 


12 


0.225 X 10" 


11 


5.5 



TABLE VIII: Runs with the ZA in the interval [0.035, 32] for configuration A. The last column 
gives the CPU seconds used on a Fujitsu VPP5000. 



No 




NcG 


^1/2 


^GW 




time 


7 


10" 


-1 


71 


0.148 X 10" 


-1 


0.243 X 10' 


-1 


0.540 X 10' 


-1 


0.9 


10 


10" 


-2 


331 


0.876 X 10- 


-4 


0.130 X 10' 


-3 


0.798 X 10' 


-2 


4.5 


12 


10" 


-3 


362 


0.833 X 10" 


-5 


0.110 X 10" 


-4 


0.499 X 10" 


-2 


5.2 


14 


10" 


-1 


394 


0.814 X 10" 


-6 


0.101 X 10" 


-5 


0.160 X 10" 


-2 


5.9 


16 


10" 


-5 


426 


0.741 X 10" 


-7 


0.902 X 10" 


-7 


0.366 X 10' 


-6 


6.6 


18 


10" 


-6 


457 


0.744 X 10" 


-8 


0.976 X 10' 


-8 


0.297 X 10' 


-7 


7.3 


20 


10" 


-7 


488 


0.692 X 10- 


-9 


0.103 X 10' 


-8 


0.313 X 10' 


-8 


8.1 


22 


10" 


-8 


516 


0.693 X 10" 


10 


0.963 X 10" 


10 


0.278 X 10" 


-9 


8.9 


24 


10" 


-9 


544 


0.644 X 10" 


11 


0.903 X 10" 


11 


0.253 X 10- 


10 


9.7 


26 


10" 


10 


572 


0.715 X 10- 


12 


0.906 X 10- 


12 


0.244 X 10- 


11 


10.4 



TABLE IX: Runs with the ZA in the interval [7.2 x 10"^, 32] for configuration B. The last column 
indicates the CPU seconds used on a Fujitsu VPP5000. 



19 



No 


^CG 


NcG 




(-GW 




time 


10 


10" 


-1 


325 


0.163 X 10" 


-1 


0.247 X 10" 


-1 


0.545 X 10" 


-1 


4.5 


16 


10" 


'2 


871 


0.242 X 10" 


^4 


0.125 X 10" 


-2 


0.111 X 10" 


-1 


13.4 


20 


10" 


-3 


899 


0.121 X 10" 


-5 


0.107 X 10" 


-5 


0.771 X 10" 


-2 


14.9 


23 


10" 


-4 


933 


0.119 X 10" 


-6 


0.105 X 10" 


-6 


0.325 X 10" 


-2 


16.3 


26 


10" 


-5 


968 


0.126 X 10" 


-7 


0.103 X 10" 


-7 


0.325 X 10" 


-2 


17.7 


30 


10- 


-6 


997 


0.462 X 10" 


-8 


0.482 X 10" 


-9 


0.749 X 10" 


-9 


19.4 


36 


10" 


'7 


1024 


0.521 X 10" 


~8 


0.150 X 10^ 


10 


0.450 X 10" 


10 


21.7 



TABLE X: Runs with the ZA in the interval [8.9 x 10 ^, 32] for configuration C. The last column 
indicates the CPU seconds used on a Fujitsu VPP5000. 

where the values of the Jacobi elliptic functions, sn('u, fc) = sin 77 and cn('u, fc) = cos 77, are 
defined by the integral 



(39) 



mm y 



^(1 -t2)(l-pt2) 

The constant in Eq. ()38|) . K = u{l), is the complete elliptic integral. When sn is near or 
1, high precision in the expressions of the coefficients of the corresponding c's is essential. 
The constant do in Eq. ()37|) can be expressed in term of elliptic theta function ^16^, or 
equivalently, fixed by the condition 

y(^]=2 (40) 

As in the ORA, the multimass CG which is used to invert the terms in Eq. (|36|) should 
have a stopping criterion, ecG- One advantage of ZA over ORA is that the quantities di in Eq. 
fl36|) are larger than those in Eq. ()31|) . As a result, a multimass CG inverter can evaluate 
this approximation somewhat faster. Another advantage, emphasized in Q is that Nq 
required for a certain accuracy is smaller for ZA than for ORA. It was found that a relative 
accuracy of better than 1 part in 10^ is obtained for the interval [0.01, 1] with Nq ~ 6 in 
ZA, as compared to 14 in ORA. As shown in Figure |21 the relative error for ZA in the range 
[10~^, 32] does not require significantly higher Nq for similar control over error. However, 
the low adaptability, A ^ 0.01, means that the coefficients should be computed over a range 
appropriate to the condition number of the matrix. The main effect of increasing the range 
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for a fixed Nq is to change the coefficients hi and di in such a way that the logarithmic range 
of di increases. We found that a factor 100 decrease in Zmin (for fixed Zmax = 32) led to a 
factor 20-30 decrease in the ratio of the minimum and maximum values of di. 

The complexity and spatial complexity of ZA are very similar to that of ORA. The 
complexity is dominated by the matrix-vector multiplication in the CG inversions, and the 
memory requirement is dominated by the vectors in the multimass CG. Hence 

CzA ^ wVlog f— ) logK, 

SzA = 8N,{N, + 2 + 2No)V, (41) 

where w' is some constant, Nc is the number of colors and V is the lattice volume. Since 
the effect of deflation is also similar to that in ORA, we do not repeat that discussion here. 

The performance of the Zolotarev algorithm in numerical tests is summarized in Tables 
I VII II HXl and |X| One needs to tune two algorithmic parameters, Nq and ecc for better 
efficiency. For a given value of ecc increase Nq until a saturation in the value of error is 
evident. For Nq ~ 6 — 8, the performance is similar to that of the ORA. The improvement 
with increasing Nq as k increases further indicates that the ZA and ORA should both 
improve if Nq is allowed to chan^algorithmically with configuration. Such a method can 
be constructed from the results of jsf, when Nq is increased until the maximum relative error 
(see Figure El) attains a fraction (<l/2) of the desired accuracy. This can be implemented 
at the initialization step from the knowledge of the minimum and maximum of the relative 
error (defined in LHS of Eg. ()4()|1 or from the elliptic theta functions). 



VI. ADAPTIVE ALGORITHM: CONJUGATE GRADIENT APPROXIMATION 



The first adaptive method used to compute M~^/^ was based on Lanczos algorithm 
In the original suggestion, the number of Lanczos steps to be taken in order to reach a given 
precision was investigated in terms of the variation of the eigenvalues of M with the number 
of Lanczos steps. A stopping criterion a la Conjugate Gradient was proposed but its relation 
to the precision was not direct. A related adaptive method based on the Conjugate Gradient 



algorithm was used in ll|. Here the stopping criterion is put on the residual vector in the 
inversion of M. This enables a direct control over the precision. 

The CGA starts with an iteration which is almost the same as the usual CG algorithm 



21 



^CG 


NcG 


ei/2 


^GW 




time 




-1 


22 


0.230 X 10" 


-1 


0.410 X 10" 


-1 


0.860 X 10" 


-1 


0.5 


10" 


-2 


55 


0.178 X 10" 


-2 


0.325 X 10" 


-2 


0.124 X 10" 


-1 


1.2 


10" 


-3 


90 


0.145 X 10" 


-3 


0.279 X 10" 


-3 


0.195 X 10" 


-2 


2.1 


10" 


-4 


125 


0.121 X IQ- 


-4 


0.266 X 10" 


-4 


0.148 X 10" 


-3 


2.9 


10" 


-5 


158 


0.103 X IQ- 


-5 


0.219 X 10" 


-5 


0.134 X 10" 


-4 


3.6 


10" 


-6 


190 


0.925 X 10" 


-7 


0.182 X 10" 


-6 


0.926 X 10" 


-6 


4.2 


10" 


-7 


222 


0.899 X 10" 


-8 


0.172 X 10" 


-7 


0.839 X 10" 


-7 


5.0 


10" 


-8 


257 


0.859 X IQ- 


-9 


0.175 X 10" 


-8 


0.827 X 10" 


-8 


5.7 


10" 


-9 


293 


0.784 X 10- 


10 


0.158 X 10" 


-9 


0.733 X 10- 


-9 


6.6 


IQ- 


10 


325 


0.708 X 10" 


11 


0.142 X 10- 


10 


0.882 X 10- 


10 


7.3 


10" 


11 


358 


0.714 X 10" 


12 


0.149 X 10" 


11 


0.600 X 10" 


11 


8.2 



TABLE XI: Runs with the CGA on configuration A (k = 10^). The last column indicates the CPU 
seconds taken on a Fujitsu VPP5000. 



(■CG 


NcG 


^1/2 


^GW 


^cc 


time 


10" 


-1 


23 


0.236 X 10" 


-1 


0.394 X 10" 


-1 


0.865 X 10- 


-1 


0.5 


10" 


-2 


212 


0.198 X 10" 


-2 


0.347 X 10" 


-2 


0.136 X 10" 


-1 


4.7 


10" 


-3 


335 


0.617 X 10" 


-4 


0.114 X 10" 


-3 


0.544 X 10" 


-2 


7.4 


10" 


-4 


365 


0.633 X 10" 


-5 


0.112 X 10" 


-4 


0.460 X 10- 


-2 


8.1 


10" 


-5 


397 


0.651 X 10" 


-6 


0.124 X 10" 


-5 


0.160 X 10- 


-2 


8.8 


10" 


-6 


428 


0.625 X 10" 


-7 


0.117 X 10" 


-6 


0.544 X 10- 


-6 


9.5 


10" 


-7 


459 


0.629 X 10" 


-8 


0.116 X 10" 


-7 


0.461 X 10" 


-7 


10.2 


10" 


-8 


489 


0.616 X 10" 


-9 


0.110 X 10" 


-8 


0.506 X 10" 


-8 


10.9 


10" 


-9 


517 


0.626 X 10- 


10 


0.109 X 10" 


-9 


0.442 X 10- 


-9 


11.6 


10" 


10 


545 


0.596 X 10- 


11 


0.995 X 10- 


11 


0.401 X 10- 


10 


12.2 


10" 


11 


573 


0.124 X 10- 


11 


0.111 X 10- 


11 


0.396 X 10- 


11 


12.9 



TABLE XII: Runs with the CGA on configuration B (k = 4.4 x 10^). The last column indicates 
the CPU seconds taken on a Fujitsu VPP5000. 
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^CG 


NcG 


Cl/2 


^GW 


^CC 


time 


10" 


-1 


62 


0.231 X 


10" 


^1 


0.381 X 10" 


-1 


0.869 X 10" 


-1 


1.4 


10" 


-2 


642 


0.393 X 


10" 


-2 


0.605 X 10" 


-2 


0.153 X 10" 


-1 


14.7 


10" 


-3 


830 


0.310 X 


10" 


-3 


0.125 X 10" 


-2 


0.874 X 10" 


-2 


19.2 


10" 


-4 


863 


0.298 X 


10" 


-4 


0.152 X 10" 


-5 


0.691 X 10" 


-2 


20.2 


10" 


-5 


891 


0.287 X 


10" 


-5 


0.169 X 10" 


-6 


0.421 X 10" 


-2 


20.7 


10" 


-6 


922 


0.304 X 


10- 


-6 


0.193 X 10- 


-7 


0.325 X 10" 


-2 


21.6 


10" 


-7 


957 


0.318 X 


10" 


-7 


0.245 X 10" 


-8 


0.325 X 10" 


-2 


22.4 


10" 


-8 


987 


0.853 X 


10" 


-8 


0.822 X 10" 


-8 


0.231 X 10" 


-8 


23.1 


10" 


-9 


1015 


0.766 X 


10" 


-8 


0.135 X 10" 


-8 


0.555 X 10" 


-8 


23.8 



TABLE XIII: Runs with the CGA on configuration C {k = 3.6 x 10^). The last column indicates 
the CPU time in seconds on a Fujitsu VPP5000. 

for the inversion of M — 

1. Start from ri — pi — ri and Pi — 0, 

2. Iterate as in regular CG, = |rip/(pjMpj), n+i = - a^Mpj, = |ri+i|^/|rip, 
and pi+i = Pi+iPi + n+i. 

3. Stop when |ri+i| < Ecg- 

Note that the the only difference from the usual CG is that the vector which is M~^$ does 
not need to be obtained during the iteration. 

In the orthonormal basis of Qi — ri/\ri\, the matrix M is the composition of the matrix 
Q whose i-th column is Qi, and a symmetric tridiagonal matrix, T, 

M^Q^TQ, where Tu^— + ^, and = (42) 

ai ai-i ai 

where ai and /3j are defined in the iteration above. Then compute the eigenvalues and 
eigenvectors of this truncated tridiagonal matrix T, 

T = UKU^ (43) 
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where A is the diagonal matrix of the eigenvalues and U the matrix of the eigenvectors in 
the basis Q. The CGA solution is 

= g*t/A-i/2f/tg$/|$|. (44) 

The adaptability of the algorithm arises from the fact that we retain only the vectors qi 
which contribute significantly to the inverse of M and we stop the iterations for i = Nqg 
when \rNcG+^\ < '^CG- 

The contribution to M"^/^ of the smallest eigenvalue 1/Amin of M will be only l/V-^min- 
Since the stopping criterion |rj+i| < e^G is meant to compute it is more stringent than 
required. One can be more generous for M~^/^, and use instead the stopping criterion 

kml <eco/\/AF (45) 

where is an upper bound of Amin- Fortunately a reasonable estimate can be obtained 
at each iteration i without large overheads. For any tridiagonal matrix T of order No-, the 
number of eigenvalues greater than a fixed number is the number of positive values of 
where this set of numbers is defined by S-^"* = Tu — jj, and 

rf(^)=T,,-A?-(T,_i,)Vrf(^--i), (46) 

for 2 < j < No Q|. An upper bound for Xq^ can always be fixed by searching for a 
number for which at least one of is non-positive. This can be done by bisection, starting 
from the initial estimate at the first step, Aq^'' = Tu. While this procedure increases the 
complexity by order log Nqg, the new stopping criterion in Eq. ()45|) has two advantages over 
the usual CG stopping criterion — first, Ncg is reduced and, second, the method becomes 
better adaptable since the observed ei/a for a given ecc becomes independent of Xmin- 

Practically, to do the computation without storing the orthonormal basis Q, one makes 
Ncg iterations to get the truncated matrix T, computes the matrix U and the diagonal A 
using standard methods [3], and then repeats the Nqg iterations to compute the solution 
L[$]. The most stringent restriction on the algorithm seems to be that one cannot use any 
pre-conditioning and must always start the iterations from pi = ri = $. This algorithm has 
only one parameter, eca- The algorithm automatically adjusts the number of iterations to 
achieve the specified precision irrespective of the condition number. Thus, no configuration 
dependent tuning of algorithmic parameters is necessary when employing the CGA for QCD 
applications. 
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The complexity of the CGA is 

CcGA ^ 2wNcgV + uoNcG^ ~ 2w'V log (— ) log k (47) 

where is a number independent of V . The Ncg^ term comes from the handling of the 
tridiagonal matrix, and can be neglected since Ncc <^ V . The space complexity is the same 
as that of a standard CG — 

ScGA ^ 8N,{N, + 3)V. (48) 

Since the method is adaptive, no deflation is necessary. However, deflation reduces the 
condition number of the matrix, and hence could improve the complexity by reducing Nqg- 
Nevertheless, for reasons that we have already discussed in connection with CA and ORA, 
deflation is unlikely to improve the performance at fixed physics when taking the limit of 
large V. 

The results of our numerical tests for this algorithm are collected in Tab lesEUxml Note 
that for all three test configurations there is a threshold in e^G above which e^c ^ lOci/a and 
below which e^c is roughly constant. The threshold value of e^G is somewhat larger than Amin 
for the configuration. Similar thresholds are also seen for the ORA and ZA. This behaviour 
possibly refiects the existence of a large unconverged subspace in the CG iterations. 



VII. COMPARING THE ALGORITHMS 

In Figure m we have collected different measures of performance of the four algorithms we 
investigated in this paper, namely, the Optimized Rational Approximation (ORA) |3], the 
Zolotarev Approximation (ZA, which is also a rational expansion) 0,0], the Chebychev Ap- 
proximation (CA, a polynomial expansion) and the Conjugate Gradient Approximation 
(CGA, an iterative method) Further details can be found in Tables HV lXllII 

It is clear that for modest values of the condition number of M, n < 10^, the CA is the 
preferred algorithm. This is clear from the figure, as well as our results for the algorithmic 
complexities in Eqs. ()28|1 . . ()4H1 and ()47j) . However, with increasing condition numbers 
the performance of CA rapidly degrades. This is visible in the figure as well as in our analysis 
of the adaptability in Eq. ()26|) . We have argued earlier that these drawbacks of the CA are 
generic to all polynomial expansions. 
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FIG. 4: Error limits as a function of the CPU time taken on a Fujitsu VPP5000 — (a) ei/2, (b) 
Cgw) (c) ecc and (d) e^r- In each case the dotted hne is for configuration A (k = 10^), the dashed 
hne for configuration B (k = 4.4 x 10^) and the full line for configuration C (k = 3.6 x 10^). 

The ORA, in its present form with fixed Nq, also suffers from a lack of adaptability. In 
principle, this can be alleviated if the order of the approximation can be chosen adaptively. 
We have implemented the ZA, which is another rational approximation, for several different 
choices of order. As can be seen from Tables IVlllI - \K\ and from Figure EJ this improves 
the performance tremendously. For comparable CPU times, corresponding to low order ZA, 
the performance is at least one order better than that of ORA on all configurations. The 
key to improving the performance of rational approximations is the automatic variation of 
the order Nq with the condition numbers. In our tests we have simulated adaptability by 
working with several different orders and retained the one corresponding only to a small 
fraction of the inversion error. The so-tuned order is only slightly higher than that obtained 
in an automatic procedure defined at the end of section El 
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^CG 

FIG. 5: The scaling of Nqg with k and Ccg- These are results of computations with configurations 
A (dashed lines), B (dotted lines) and C (full lines). 

The CGA depends on only one parameter eca- For a given value of eca, the corresponding 
errors ei/a and ecw are almost independent of the condition number of the matrix, thanks to 
the relaxed stopping criterion. The price for such a good adaptability is a computing time 
which is 50% higher than ZA for a given accuracy (70% excess if No is small, 20% for large 
No). The price, however, ensures that for all the configurations one guarantees the same 
order of accuracy from a given value of eca and with a predicted value of ecw 

The variation in the number of conjugate gradient iterations, Nqg, as the stopping cri- 
terion, eca, is changed for the three configurations is shown in Figure El The data for the 
ORA are not shown in the figure because they are very similar to those of the ZA. Note that 
the curves for the CGA lie below that for the ZA (despite the shift in ZA as compared to 
CGA) , which is the influence of the relaxed stopping criterion discussed in Section IVII Ref . 

has devised a similar modification for ZA which can reduce Nqg in that case. Note that 
our above results for ZA did not use any such modifications; using it will further enhance 
the performance of ZA reported above. 

We have noted in Section |n] that the relations ()18|) between the errors are valid for those 
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FIG. 6: The scaling of with ecc in the CGA and ZA for configurations A (dashed hnes) and B 
(full lines). 

approximations to M~^/^ which commute with M. In particular, we noted that for the 
iterative algorithms these relations become valid, provided that e^G is sufficiently small. 
In Figure IHl we demonstrate this for e^, which is expected to be zero when e^c is small 
enough. For the CGA and ZA (data for the ORA are not shown because they almost 
coincide with that for ZA), decreases with eca- The slopes in this plot correspond to 
linear decrease when e^G is sufficiently small. Clear non-linearities are present for larger e^G 
when the condition number is large. We believe that these non-linearities are due to large 
non-converged subspaces, implying a need for high accuracy. 

For fixed order algorithms the adaptability. A, quantifies the configuration dependence 
of speed. The numerical study can be used more directly to illustrate the adaptability by 
studying the slowdown in going from configuration A to B (i.e., from k = 10^ to 4.4 x 10^) 
or from A to C (k changes from 10^ to 3.6 x 10^). As shown in Figure [71 both ZA and CGA 
are adaptable algorithms over a wide range of ei/a- Since ZA is faster, as seen in Figure lU it 
is thus the method of choice. Note, however, that CGA is very comparable to it, and may 
be preferred for its self-tuning ability. 
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FIG. 7: The ratio of CPU time taken at fixed error, ei/2, for (a) configurations B and A (dashed 
Unes) and (b) configurations C and A (full lines) for three different algorithms. 

We emphasise that a fair test of relative performance of algorithms is to work without 
deflation. First, deflation improves the performance of each of the algorithms we have 
investigated. Details are given in the sections on each algorithm. Nevertheless the algorithms 
based on rational approximation seems to be less sensitive to deflation than other ones 
because of the positive shifts introduced in the matrix. Second, since the computation of 
the cigcnsystcm of M, necessary to deflation, is done at finite accuracy, it introduces extra 
errors. If the error in the computation of the eigenvalue Aj is ^j, then the contribution to €^12 
is (5j/Aj. Thus, if we want to achieve a given 61/2, then we must keep < 61/2 |Aj|. When 
the condition number k increases, this criterion becomes impossible to satisfy, leading to 
catastrophic loss of accuracy. 

We feel it worth pointing out that deflation is only one of many possible methods to 
decrease the effective condition number of the problem. Other preconditioning methods 
have not been seriously explored for overlap Fermions. The cost of accurate numerical 
methods seems to suggest that numerically stable preconditioning methods will pay a big 
dividend in this problem. 
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